FishEye International, a non-profit organization dedicated to combatting the scourge of illegal, unreported and unregulated (IUU) fishing, has been granted access to fishing-related companies’ financial database, offered by an international finance corporation. Through their previous investigations, FishEye has discovered that companies with unusual arrangements are more likely to be engaged in IUU activities or other questionable practices. To leverage this valuable resource, FishEye has transformed the database into a comprehensive knowledge graph, which encompasses data on companies, owners, workers and revenue.
The primary objective of our project is to employ this graph to detect irregularities that may indicate a company’s involvement in IUU fishing.
Dataset contains the knowledge graph with 27,622 nodes and 24,038 edges:
Node Attributes:
type – Type of node as defined above.
country – Country associated with the entity. This can be a full country or a two-letter country code.
product_services – Description of product services that the “id” node does.
revenue_omu – Operating revenue of the “id” node in Oceanus Monetary Units.
id – Identifier of the node is also the name of the entry.
role – The subset of the “type” node, not in every node attribute.
Edge Attributes:
type – Type of the edge as defined above.
source – ID of the source node.
target – ID of the target node.
dataset – Always “MC3”.
role - The subset of the “type” node, not in every edge attribute.
We use as_tibble() to extract node data into a tibble.
Show code
mc3_nodes <-as_tibble(mc3_data$nodes)
We wrangle the node data:
mutate() and as.character() are used to convert the field data type from list to character.
To convert revenue_omu from list data type to numeric data type, we need to convert the values into character first by using as.character(). Then, as.numeric() will be used to convert them into numeric data type.
select() is used to re-organise the order of the fields.
In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_edges tibble data frame.
Show code
skim(mc3_edges)
Data summary
Name
mc3_edges
Number of rows
24036
Number of columns
4
_______________________
Column type frequency:
character
3
numeric
1
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
source
0
1
6
700
0
12856
0
target
0
1
6
28
0
21265
0
type
0
1
16
16
0
2
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
weights
0
1
1
0
1
1
1
1
1
▁▁▇▁▁
The report above reveals that there is no missing values in all fields.
datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document.
Show code
DT::datatable(mc3_edges)
We use ggplot to plot the frequency of different types of relationship in the edges. We have two types: beneficial owner and company contacts.
Show code
type_freq<-ggplot(mc3_edges, aes(x=type))+geom_bar()+theme_minimal()+ggtitle('Frequency Count by Relationship Type')ggplotly(type_freq)
We take a look at the number of companies under each owner. Most owners have only 1 company. We will extract the graph for those owners with more than 3 companies under them. There are 67 such owners. We will look at the network of these owners.
Show code
owner_companycnt<-mc3_edges %>%filter(type=='Beneficial Owner') %>%group_by(target) %>%summarise(company_count=n()) %>%arrange(desc(company_count)) %>%ggplot(aes(x=company_count)) +geom_bar()+scale_x_continuous(breaks=c(1:10))+theme_minimal()+ggtitle('Frequency Count by Number of Companies under each owner')ggplotly(owner_companycnt)
Show code
# Extract onwers with more than 3 companiesowners<-mc3_edges %>%filter(type=='Beneficial Owner') %>%group_by(target) %>%summarise(company_count=n()) %>%filter(company_count>3)
Node data
Use skim() of skimr package is used to display the summary statistics of mc3_nodes tibble data frame.
Show code
skim(mc3_nodes)
Data summary
Name
mc3_nodes
Number of rows
27622
Number of columns
5
_______________________
Column type frequency:
character
4
numeric
1
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
id
0
1
6
64
0
22929
0
country
0
1
2
15
0
100
0
type
0
1
7
16
0
3
0
product_services
0
1
4
1737
0
3244
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
revenue_omu
21515
0.22
1822155
18184433
3652.23
7676.36
16210.68
48327.66
310612303
▇▁▁▁▁
There are missing values in revenue_omu but no missing values in other variables: id, country, type, product_services
datatable() of DT package is used to display mc3_nodes tibble data frame as an interactive table on the html document.
Show code
DT::datatable(mc3_nodes)
Plot the frequency of type of nodes using geom_bar()
cp_plot<-mc3_nodes %>%filter(type=='Company') %>%group_by(id) %>%mutate(country_count=n_distinct(country)) %>%arrange(desc(country_count)) %>%ggplot(aes(x=country_count))+geom_bar()+scale_x_continuous(breaks=c(1:10))+theme_minimal()+ggtitle('Frequency Count by Number of Countries of each company')ggplotly(cp_plot)
We will save companies that operate in multiple countries in cp_multicount
Let’s also take a look at the distribution of different countries for the companies.
Show code
mc3_nodes %>%filter(type=='Company') %>%select(id, country) %>%distinct() %>%group_by(country) %>%mutate(cnt=n()) %>%ggplot(aes(x=reorder(country, cnt)))+geom_bar()+theme_minimal()+ggtitle('Frequency Count by Countries of Company')+theme(axis.text.x =element_text(angle =90, vjust =0.5, hjust=1))
Identify shell companies from ownership network graph
Owners with high number of companies
Let’s take a look at the network of owners with high number of companies under them. We want to check if these companies under these owners are also co-owned by other owners. If a owner owns many companies that are not co-owned by others, these are likely shell companies.
# A tbl_graph: 362 nodes and 313 edges
#
# An unrooted forest with 49 trees
#
# A tibble: 362 × 4
name type betweenness_centrality degree_centrality
<chr> <chr> <dbl> <dbl>
1 Acevedo, Dickson and Gonzalez Company 0 1
2 Adams Group Company 0 1
3 Adams-Pope Company 0 1
4 Adriatic Catch S.A. de C.V. Company 0 1
5 Albertine Rift NV Family Company 0 1
6 Alexander PLC Company 0 1
# ℹ 356 more rows
#
# A tibble: 313 × 4
from to type weights
<int> <int> <chr> <int>
1 1 296 Beneficial Owner 1
2 2 297 Beneficial Owner 1
3 3 298 Beneficial Owner 1
# ℹ 310 more rows
# A tibble: 64,202 × 5
id country type revenue_omu word
<chr> <chr> <chr> <dbl> <chr>
1 Jones LLC ZH Company 310612303. automobiles
2 Coleman, Hall and Lopez ZH Company 162734684. passenger
3 Coleman, Hall and Lopez ZH Company 162734684. cars
4 Coleman, Hall and Lopez ZH Company 162734684. trucks
5 Coleman, Hall and Lopez ZH Company 162734684. vans
6 Coleman, Hall and Lopez ZH Company 162734684. and
7 Coleman, Hall and Lopez ZH Company 162734684. buses
8 Aqua Advancements Sashimi SE Express Oceanus Company 115004667. holding
9 Aqua Advancements Sashimi SE Express Oceanus Company 115004667. firm
10 Aqua Advancements Sashimi SE Express Oceanus Company 115004667. whose
# ℹ 64,192 more rows
We perform a frequency count for the words.
Show code
token_nodes %>%count(word, sort =TRUE) %>%top_n(15) %>%mutate(word =reorder(word, n)) %>%ggplot(aes(x = word, y = n)) +geom_col() +xlab(NULL) +coord_flip() +labs(x ="Count",y ="Unique words",title ="Count of unique words found in product_services field")
The bar chart reveals that the unique words contains some words that may not be useful to use. For instance “a” and “to”. In the word of text mining we call those words stop words. You want to remove these words from your analysis as they are fillers used to compose a sentence. We also will filter common words such as unknown and products, which will not be helpful in distinguishing the companies.
stopwords_removed %>%count(word, sort =TRUE) %>%top_n(15) %>%mutate(word =reorder(word, n)) %>%ggplot(aes(x = word, y = n)) +geom_col() +xlab(NULL) +coord_flip() +labs(x ="Count",y ="Unique words",title ="Count of unique words found in product_services field")
Following the completion of cleaning and visualization, the subsequent stage involves conducting Latent Dirichlet Allocation (LDA). LDA is an iterative algorithm utilized to reveal topics by examining discrete word frequencies. The underlying notion behind LDA is that documents typically pertain to a limited number of topics, and these topics are generally based on a small set of words. However, prior to that, it is necessary to construct a Document Term Matrix. This matrix mathematically represents the occurrence frequency of terms within a collection of documents. In the document-term matrix, each row corresponds to a document in the collection, while each column corresponds to a term.
Show code
dtm<- stopwords_removed %>%count(id, word) %>%cast_dtm(id, word, n) %>%as.matrix()
We use ldatuning to select the number of topic for LDA model.
Show code
result <-FindTopicsNumber( dtm,topics =seq(from =2, to =20, by =1),metrics =c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),method ="Gibbs",control =list(seed =77),mc.cores = 2L,verbose =TRUE)
One straightforward method for analyzing metrics involves identifying extrema. For a more comprehensive understanding, please refer to the relevant papers:
Minimization:
Arun2010 [1]
CaoJuan2009 [2]
Maximization:
Deveaud2014 [3]
Griffiths2004 [4,5]
To facilitate easy analysis of the outcomes, the FindTopicsNumber_plot support function can be utilized.
Show code
FindTopicsNumber_plot(result)
From the graph, it appears that 7 topics are optimal.
Label the company by industry and observe patterns within industry
We will use tidy() to get the probability of the topic for each company. Then we will take a look at each industry in terms of their network and interactions and disregard companies with unknown products and services.
Show code
# Extract probability of the topic for each companylda_topics <-LDA( dtm,k =5,method ="Gibbs",control =list(seed=42) ) %>%tidy(matrix ="gamma")# Summarize company revenuecompany<-mc3_nodes %>%filter(type=='Company') %>%group_by(id) %>%summarise(revenue=sum(revenue_omu))
Show code
# Assign topic with the highest gamma score to the document/companycp_map<-lda_topics %>%group_by(document) %>%summarise(gamma=max(gamma)) cp_map<-cp_map %>%left_join(lda_topics) %>%mutate(topic=recode(topic, '1'="accessories_materials",'2'="food",'3'="equipment",'4'="fish_seafood",'5'="services")) %>%rename("Industry"="topic") %>%select(document, Industry)# Use left join to join back to company revenue infocompany<-company %>%left_join(cp_map, by=c("id"="document"))# Drop those company with unknown industrycompany<- company %>%drop_na(Industry) %>%drop_na(revenue)
One company can be in a few industries at the same time.
Identify the patterns within each industry
Revenue pattern
We can check the distribution of revenue pattern. We need to remove the extreme outliers by using IQR.
Show code
company %>%ggplot(aes(x=Industry, y=revenue))+geom_boxplot()+theme_minimal()
We observed right skewed distribution of revenue in all industries.
Show code
company_rmo %>%ggplot(aes(x = revenue, y = Industry,fill=factor(after_stat(quantile)) )) +stat_density_ridges(geom ="density_ridges_gradient",calc_ecdf =TRUE, quantiles =4,quantile_lines =TRUE) +scale_fill_viridis_d(name ="Quartiles") +theme_ridges()+ggtitle("Distribution of Revenue across different Industries")+theme_minimal()+ylab("Industries")+xlab("Revenue")
We performed statistical analysis to find out if the mean revenue is different for the different industries. From the results, we can see that equipment and food industry have the highest mean revenue. Fishing, accessories/materials and services have lowest mean revenue.
Show code
ggbetweenstats(data = company_rmo,x = Industry, y = revenue,type ="np",mean.ci =TRUE, pairwise.comparisons =TRUE, pairwise.display ="s",p.adjust.method ="fdr",messages =FALSE) +ggtitle("Compare mean revenue among Industries")+xlab("Industries")